design requirements

Done

The Philosophy of System Design

Concepts

Trade-offs: There is rarely a single correct answer or perfect solution in system design
Analysis: It is not about memorizing facts, but about analyzing trade-offs to decide which option is better for a specific problem
Goal: The objective is to design effective systems to solve very large problems

The Three Core Data Operations

Context: At a high level, no matter how complex a system is, it boils down to three actions

Moving Data

Concepts

Data moves between components (RAM to CPU, CPU to Disk) or between machines across the world
Moving data across networks and different data centers adds complexity compared to local movement

Storing Data

Concepts

Data can be stored in volatile memory (RAM) or persistent storage (Disk)
Different storage methods (Databases, Blob stores, File systems) have different trade-offs
Critical warning: Bad design choices in storage are very hard to correct later. Migrating data is much harder than refactoring code

Transforming Data

Concept: This involves taking input data and manipulating it to get useful output

Example: Aggregating raw server logs to calculate the percentage of successful vs. failed requests

Availability

Concept: The percentage of time a service is functioning and able to respond to requests

Formula: Availability = uptime / (uptime + downtime)

The Nines

Concepts

Availability is measured in nines
99% (Two Nines): Down for about 3.65 days per year. (Not very good for major businesses)
99.999% (Five Nines): Down for only about 5 minutes per year. (Very solid)

SLO vs. SLA:

Concepts

SLO (Service Level Objective): The internal goal developers set for the system (e.g., We want 99.999% uptime)
SLA (Service Level Agreement): An agreement with the customer. If the SLO is not met, there are consequences, such as a partial refund

Reliability, Redundancy, and Fault Tolerance

Concepts: These terms are often used interchangeably but have specific meanings regarding system failure

Reliability: The probability that the system will not fail
- Adding a second server increases reliability because if one fails, the other works
Fault tolerance: The ability of a system to continue functioning even if a part of it fails
- Redundancy: Having unnecessary copies of components (like a second server running the exact same code)
- You don't need both to handle traffic, but the copy exists to take over in case of failure
Single point of failure: A system with only one server is a single point of failure. If it crashes, the whole system dies

Throughput

Concepts: Throughput measures how much work a system can handle over a period of time. It is measured in three main ways

Requests Per Second (RPS): Used for servers. How many user requests can be handled concurrently?
Queries Per Second (QPS): Used for databases. Conceptually similar to RPS but specific to reading/writing data
Bytes per second: Used for data pipelines
- Useful when processing massive data not tied to specific user requests (e.g., processing 1 Terabyte of logs)
- Helps calculate how long a job will take (e.g., processing 1TB at 1GB/sec takes 1,000 seconds)

Scaling Strategies

Context: To improve availability and throughput, you can scale the system

Vertical scaling
- Making the single server better (more RAM, better CPU)
- Pros: Simple to design
- Cons: Limited by hardware physics and remains a single point of failure
Horizontal scaling
- Adding more servers (copies) rather than making one server stronger
- Pros: Increases reliability and redundancy; virtually unlimited scaling potential
- Cons: More complex. Requires a Load Balancer to distribute traffic and makes database management harder (distributed data)

Latency

Concepts

The amount of time it takes for a single operation to complete (e.g., 1 second for a page to load)
Throughput vs. latency
- Throughput = How many requests per second
- Latency = How fast is one request
Reducing latency
- Caching: Storing data in faster storage. Reading from Cache (nanoseconds) is orders of magnitude faster than RAM (microseconds)
- Geography: Distance matters. A user on the other side of the world experiences high latency
- Solution: Place servers in different parts of the world so users connect to a server physically closer to them

The Philosophy of System Design​

The Three Core Data Operations​

Moving Data​

Storing Data​

Transforming Data​

Availability​

The Nines​

SLO vs. SLA:​

Reliability, Redundancy, and Fault Tolerance​

Throughput​

Scaling Strategies​

Latency​

Infographics​

The Philosophy of System Design

The Three Core Data Operations

Moving Data

Storing Data

Transforming Data

Availability

The Nines

SLO vs. SLA:

Reliability, Redundancy, and Fault Tolerance

Throughput

Scaling Strategies

Latency

Infographics